Capturing Out-of-Vocabulary Words in Arabic Text
نویسندگان
چکیده
The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, and so before any stemming, foreign words need to be identified. In this paper, we investigate three approaches for the identification of foreign words in Arabic text: lexicons, language patterns, and n-grams and present that results show that lexicon-based approaches outperform the other techniques.
منابع مشابه
A Rule-Based Arabic Text-To-Speech System Based On Hybrid Synthesis Technique
The field of speech synthesis or Text-To-Speech has rapidly expanded during last few years due to the wide range of applications that require human-machine interaction. Arabic language, the fourth most spoken language on the globe, has received the attention of the researchers in development of an intelligible and close to natural Text-To-Speech system. Most of the available Arabic Text-To-Spee...
متن کاملText Categorization with Semantic Commonsense Knowledge
Most of text categorization research exploit bag-of-words text representation. In this approach, however, all contextual information contained in text is neglected. Therefore, capturing semantic similarity between text documents that share very little or even no vocabulary is not possible. In this paper we present an approach that combines well established kernel text classifiers with external ...
متن کاملOn the Evaluation of Lexical Profiles in the Iranian High School and University EFL Text Books
متن کامل
Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملBuckwalter-based Lookup Tool as Language Resource for Arabic Language Learners
The morphology of the Arabic language is rich and complex; words are inflected to express variations in tense-aspect, person, number, and gender, while they may also appear with clitics attached to express possession on nouns, objects on verbs and prepositions, and conjunctions. Furthermore, Arabic script allows the omission of short vowel diacritics. For the Arabic language learner trying to u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006